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Abstract —We describe a new visualization tool, dubbed 
HCMapper, that visually helps to compare a pair of dendrograms 
computed on the same dataset by displaying multiscale partition- 
based layered structures. The dendrograms are obtained by 
hierarchical clustering techniques whose output reflects some 
hypothesis on the data and HCMapper is speciflcally designed to 
grasp at first glance both whether the two compared hypotheses 
broadly agree and the data points on which they do not concur. 
Leveraging juxtaposition and explicit encodings, HCMapper 
focus on two selected partitions while displaying coarser ones 
in context areas for understanding multiscale structure and 
eventually switching the selected partitions. HCMapper utility 
is shown through the example of testing whether the prices of 
credit default swap financial time series only undergo correlation. 
This use case is detailed in the supplementary material at 
www.datagrapple.com/Tech as well as experiments with code on 
toy-datasets for reproducible research. HCMapper is currently 
released as a visualization tool on the DataGrapple time series 
and clustering analysis platform at www.datagrapple.com, 


Index Terms —Viewing algorithms; Model validation and anal¬ 
ysis; Hierarchical clustering; Time series analysis 

1. Introduction 

Hierarchical clustering is a pervasive quantitative technique 
used in many fields leveraging exploratory analysis such as 
audio m or image f2l signal processing, social sciences, 
natural language processing fa, El, finance from real estate 
la and stock O portfolios analysis for investment, hedging or 
risk management purposes, to labour division amongst analysts 
Q, and foremost in phylogenetics where the hierarchy, the 
Tree of Life, is the gist [0, ||9l. Thus, it is paramount to 
be able to understand and compare hierarchical clustering 
structures. To do so, many approaches are available: as a first 
step, one can look for an isomorphism between trees which is 
decidable in linear time in the number of vertices cni; then, 
one can try to quantify how dissimilar they may be using, for 


instance, editing distance m whose computation was shown 
to be an NP-complete problem, Robinson-Foulds distance Cal 
and its linear-time computation in the number of leaves El, 
or an index ifT^ extending the comparison of partitions ca 
to the whole hierarchies. However, the raw number yielded 
may be difficult to interpret, and does neither point out 
where differences appear nor what they consist in, failing to 
provide substantial information to the practitioner who may 
be interested by studying the cluster splits and the divergences 
between the two hierarchical structures. Yet the object of study, 
namely hierarchical organization, is also popular to visualize 
data CSl: it represents information at different levels of detail, 
and is amenable to interactive browsing, e.g. users can drill 
up and down the hierarchy. Much work has been done for the 
last two decades to effectively visualize trees: Cone Tree lfT7]| . 
Treemaps ca, El, Hyperbolic Tree l20l . and many more 
ED. In comparison, less works have been led for representing 
and comparing multiple trees 12^ despite its importance for 
field experts and practitioners who need to understand how 
dissimilar are clustering hierarchies produced by different clus¬ 
tering methods since these structural instabilities might point 
out objects of special interest. The Hierarchical Clustering 
Mapper aims at finding them: it highlights clustering singular¬ 
ities between two models. After reviewing related work, we 
explain in-depth design and features of the HCMapper before 
briefiy showing its usefulness for testing two hypotheses on 
financial time series and finally discuss its current limitations. 

II. Related work 

In the visualization literature, tree comparisons have been 
adressed by leveraging the display space. Two trees can be 
displayed side by side 1^ . one on top of the other 1^ . or 
the two facing each other 1^ . When dealing with (much) 
more than two trees to compare, 1261 and its extension IZTl 
consider a treespace by using a distance between trees, namely 
Robinson-Foulds metric, and project, using multidimensional 
scaling, hundreds of trees corresponding to phytogenies on 


the Euclidean plane. The purpose is to look for consensus 
between models depicted by clusters of tree-points on the 
plane. By hovering over a cluster, they display the consensus 
tree associated to it, but then for finer comparison of the poten¬ 
tially several consensus trees produced they just display them 
side by side coming back to the problem of pair comparison 
using MacClade or Mesquite softwares, or TreeJuxtaposer 1^ 
leveraging a focus-Fcontext approach to deal with large trees. 
Taking for granted that tree comparisons may require lots of 
display space, authors of 1^ developed an impressive tabletop 
framework to display trees under various representations and 
allow collaborative work of a team leaned on the tabletop to 
compare them. Others more interested in attributed trees, i.e. 
trees whose leaf nodes are associated with attributes, than in 
their topologies 1^ opt for innovative treemap layouts, still 
displaying them side by side, but claiming that users can better 
understand the changes in the hierarchy and layout, and notice 
more quickly the color and size differences describing changes 
in the attributes. Alternatively, some decide to superimpose 
the two trees to primarily deal with uncertainty on some part 
of the trees 1^ . but authors of CandidTree mention that it 
can also be applied to show differences between hierarchical 
structures by giving the exemple of backup directory structures 
to grasp which files or folders were deleted or moved. But once 
again, the focus is on the tree topology and not on underlying 
partition-based clustering. Designs mentioned fall into the gen¬ 
eral taxonomy of visual designs for comparison EqI, they in¬ 
volve juxtaposition or superposition. The remaining category, 
namely explicit encoding, has been used in tanglegrams ISTIl . 
1^ which are pair of trees whose leaf sets are in one-to-one 
correspondence and whose matching leaves are connected by 
inter-tree edges. Actual comparison is based on the measure¬ 
ment of minimum edge entanglements connecting the leaves 
from the two hierarchies, yet minimizing the entanglements, 
a two-sided crossing minimization problem (2SCM), is an 
NP-hard problem, and thus results are heuristics-dependent. 
An extension of the tanglegram is presented in GU focusing 
on comparing hierarchically organized data in the context of 
software systems by using an edge-bundling technique which 
highlight splits, joints and relocations of subhierarchies. The 
latter bears much resemblance to our work at first sight but 
our diverging goals have led to different designs, ours focusing 
more on consensus and outliers between two partitions at a 
time rather than providing a global overview of the differences 
and similarities between the two hierarchies in a single picture. 

III. Hierarchical Clustering Mapper explained 

In this section, we detail the design of HCMapper by first 
describing the mapping from input data to partition layers, and 
then how these layers are displayed to highlight moot points 
from the dataset. 

A. Construction and visualization layout 

We start from a dataset A’ = {xi,... ,Xn}. We obtain a 
dendrogram on A’ using a hierarchical clustering algorithm 
whose output reflects an hypothesis over A'. For comparing 


two hypotheses over A', we begin by building two such 
dendrograms. Then, we extract from each dendrogram all 
possible fiat partitions over A', thus transforming each one 
into a tree whose vertices at a given depth define a partition 
over A', partitions ranging from the coarsest one at the root to 
the finest one at the leaves as illustrated in Figure 



Eig. 1. Extracting flat partition-based clustering from a dendrogram and 
transforming it into a tree of clusters; all clusters at a given depth in this tree 
form a partition over the dataset X = {xi^X 2 ^x^^ X 4 ,^x^}. 

Bonds between the two tree vertices are explicitly encoded 
as links representing whether their associated cluster intersects 
or not. To avoid decontextualization using only explicit encod¬ 
ing design between two partitions, we also use juxtaposition 
by displaying coarser layers from each tree to understand the 
emergence of the two studied partitions, and a multiple view 
interactive system to switch the studied partitions. To avoid 
cluttering in this hybrid approach, we can activate a fisheye 
effect to read textual information if needed. Since cluster size 
is a significant information of clustering, we display it through 
its associated vertex whose size is proportional to the ratio of 
the cluster size over the dataset size. Even more important 
to our task, understanding clusters intersection: intersection 
size is encoded through the size of the edge that bonds them, 
wider is the edge relatively to the cluster size, stronger the 
consensus; intersection content can be displayed by hovering 
over the edge. To display this graph, we can leverage the 
D3.js |33l Sankey, a highly customizable chart, which is 
amenable for adding application-oriented features such as the 
ones recommended in El for visualizing dynamic hierarchies. 
Yet inspiring we think this work poorly handles outliers using 
a reporting tool which merely lists them in textual form. We 
believe that HCMapper is more suited to this specific task. 
Concerning time complexity for building the visualization 
from scratch on A' = {xi,..., it requires 0{n^ log n) for 
applying agglomerative hierarchical clustering algorithm, then 
for transforming the dendrograms obtained it costs O(n^), and 
finally for displaying V vertices 0{V^). 

B. HCMapper reading 

The HCMapper graph is depicted in Figure There are 
three main areas: two contexts, and a focus. The left and right 
contexts (Tree 1 and Tree 2 in the picture) represent some of 
the successive coarser layers of the left and right inner layers 
(colored layers in the image) to be compared in the focus 
area. The main goal is to understand how these two partitions 
are related. In Figure we compare a partition at depth n 
obtained from a dendrogram reflecting hypothesis 1, and a 





















partition at depth m obtained from the alternative dendrogram 
reflecting hypothesis 2. Looking at the left partition in the 
focus area, we can see that it is composed of three clusters 
(green, yellow, orange) of approximately equal size. The right 
partition consists in four clusters (light green, deep green, light 
orange and red). We can observe that the left green cluster 
is mapped onto the light and deep green clusters from the 
right partition which consist in a reflnement. However, right 
partition is not a strict reflnement of the left partition as it can 
be seen by looking at the right light orange cluster. Indeed, 
this one is composed of the left yellow cluster and a part of the 
left orange cluster, the latter evenly splitting into the right light 
orange cluster and the red one. But, what is the most important 
to notice here, and which is visually blatant, is the small edge 
diverging from the bulk of edges linking the left green cluster 
to the two right green clusters. This singular edge links an 
element from the left green cluster to the corresponding same 
labeled element yet belonging to the red cluster in the right 
partition. This actually can be considered as an outlier or a 
point of special interest since the two hypotheses broadly agree 
but on this point. Notice that flnding so readily this special 
point would be much harder by looking through clustering 
textual results or by inspecting common visual comparison 
between dendrograms such as the tanglegram displayed in 
Figure which is implemented in the dendextend R package. 


Hypothesis 1 Hypothesis 2 



Fig. 2. Two hypotheses are compared through the dendrograms which were 
transformed into two trees of partition-layers; note the diverging edge from 
the green cluster to the red one which highlights a moot point of special 
interest for experts. 

IV. Application to financial time series analysis 

Quantitative analysts and decision-makers have noticed that 
on some flnancial markets such as stocks, bonds and vanilla 
derivatives, time series of assets prices tend to cluster ac¬ 
cording to their pairwise correlation (61. But, is correlation 
really the only explanatory factor? By assuming a random walk 
model for prices, correlation and distribution amount for the 
whole information (35) in these time series. To answer this 
question, we state two hypotheses: there is only correlation 


in the data (HI); there is correlation and distribution in the 
data (H2). Then, we compute a dendrogram reflecting HI and 
another for H2 before building the HCMapper graph displayed 
in Figure At once, one can notice that clusters from HI 
hypothesis are broadly in a one-to-one correspondence with 
clusters from the H2 hypothesis, but a few outliers highlighted 
by thin diverging edges. Thus, thanks to the HCMapper 
visualization, we can conclude that correlation is the main 
explanatory factor for the clustering of prices time series 
since only few clusters are modifled by adding the distribution 
information. This can be explained by market microstructure: 
market makers are specialized and cover speciflc sectors, thus 
adding correlation between prices which are already influenced 
by common macroeconomic factors. Yet, the few moot points 
found are of paramount signiflcance for experts as they may 
correspond to assets whose price variations undergo heavy¬ 
tailed distribution or suffer from illiquidity, therefore particular 
attention should be given to these assets while performing a 
risk analysis. For a thorough treatment of this section, refer to 
(36), (371 and www.datagrapple.com/Tech 



Fig. 3. Left: two partitions extracted from a dendrogram for pure 
correlation hypothesis; Right: two partitions extracted from a dendrogram for 
dependence+distribution hypothesis. We can readily conclude that correlation 
is paramount for clustering prices. By hovering over the thin edge diverging 
from the “America” cluster in the pure correlation hypothesis, we can read 
AL (Rio Tinto) and CNG (AT&T) on the tooltip. These American compa¬ 
nies are clustered with high quality European government debt assets (e.g. 
Norway, Sweden, Denmark), all of them having a daily variation distribution 
characterizing illiquid products. 

V. Conclusions and future work 

HCMapper is an efficient tool for detecting moot points 
in a dataset with respect to several hypotheses described 
by different hierarchical clusterings. Its time complexity in 
0{n^ logn + V^), where n is the number of points and V the 
number of vertices to display, makes it unusable for very large 
datasets (cost of hierarchical clustering) or large display graphs 
(cost of positioning vertices). To our opinion, HCMapper 
is a promising tool for comparing dendrograms, provided it 
undergoes adaptations to fit to one’s needs. Concerning its 
evolution, we plan to extend HCMapper scope and ergonomy 
by using an information-geometric approach (3^ allowing for 








































Fig. 4. Left: two partitions extracted from a dendrogram for pure 
correlation hypothesis; Right: two partitions extracted from a dendrogram for 
dependence+distribution hypothesis; Consensus or divergence is much harder 
to grasp in the dendextend tanglegram (m than in HCMapper. 


multiple foci+contexts to simultaneously zoom-in on several 
interesting cluster splits. HCMapper is publicly available and 
in daily use: this visual tool was integrated in the DataGrapple 
project platform (www.datagrapple.com) where anyone can 
test its usefulness on credit default swaps historical time series 
data. We have released both implementation and use case, and 
hope it will be helpful for biologist researchers as well. 
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